-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UTF-8 characters in feature name again #2976
Conversation
This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.
@henry0312 did you test for this? LightGBM/src/boosting/gbdt_model_text.cpp Line 481 in e6de39a
you can test that by save model to string/file, and load it back. |
@guolinke I just tried and confirm that there is no problem. import lightgbm
import numpy
from matplotlib import pyplot
numpy.random.seed(42)
train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)
train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)
# This has non-ascii strings.
feature_names = ['F_零', 'F_一', 'F_二', 'F_三']
params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2', 'l1'},
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0,
'seed': 42,
}
print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
train_lgb,
num_boost_round=20,
valid_sets=valid_lgb,
feature_name=feature_names,
evals_result=evals_result,
early_stopping_rounds=5)
# feature names
print('Feature names:', gbm.feature_name())
print('Saving model...')
# save model to file
gbm.save_model('model.txt')
print('Loading model to predict...')
# load model to predict
gbm2 = lightgbm.Booster(model_file='model.txt')
# feature names
print('Feature names:', gbm2.feature_name()) ❯ python test_utf8.py
Starting training...
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000196 seconds.
You can set `force_col_wise=true` to remove the overhead.
[1] valid_0's l1: 0.234781 valid_0's l2: 0.0734969
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2] valid_0's l1: 0.235442 valid_0's l2: 0.0736321
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3] valid_0's l1: 0.236002 valid_0's l2: 0.0738012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4] valid_0's l1: 0.236452 valid_0's l2: 0.0738955
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5] valid_0's l1: 0.236552 valid_0's l2: 0.0739692
[6] valid_0's l1: 0.236756 valid_0's l2: 0.0739201
Early stopping, best iteration is:
[1] valid_0's l1: 0.234781 valid_0's l2: 0.0734969
Feature names: ['F_零', 'F_一', 'F_二', 'F_三']
Saving model...
Loading model to predict...
Feature names: ['F_零', 'F_一', 'F_二', 'F_三'] EDIT: 2020-04-06 19:07 |
great! I think we can add more test cases for utf8 feature names. |
@guolinke Sure! Please wait a moment. |
I think we should revisit R-package as well to ensure it supports non-ASCII names. |
As I remember, we hadn't had any problems with feature names before 0d59859. |
By the way, we must drop support of Python 2.x as soon as possible. |
@henry0312 I'm surprised by the R test errors you got, but they don't seem related to this PR and the fixes are ok with me generally (I'll leave a comment) I'll add some test code for R here in a few minutes |
I'm looking at the logs and really don't understand these errors, or what is different between the CI environment and my laptop. I'm running the tests from a shell so it's not like my local environment is benefitting from weird RStudio magic with the environment. The new test is failing But so are 8 others |
|
json library of R may encode strings, although I don't know much about R. |
I think functions around https://rlang.r-lib.org/reference/chr_unserialise_unicode.html are related. |
If it is not trivial to enable UTF-8 support in R, I think we can split Python and R changes in separate PRs. |
Do we need to report the current issue somewhere, or they are aware as it is "every year" problem? |
Finally, all tests have passed (but include some work-arounds, which are not related to Python😅)
No need, but we will continue to check the thread. |
@henry0312 I extracted the workarounds not related to this PR in #2977 and just merged it into |
Followed the current master/HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the following decode
s?
LightGBM/python-package/lightgbm/basic.py
Line 2585 in 91185c3
ret = json.loads(string_buffer.value.decode()) |
LightGBM/python-package/lightgbm/basic.py
Lines 2956 to 2957 in 91185c3
self.__name_inner_eval = \ | |
[string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)] |
Looks ok to me from R side! But my approval shouldn't count towards a merge. |
@StrikerRUS yeah, your points are right.
@jameslamb can you approve again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one minor comment!
This commit reverts 0d59859.
Also see:
I reproduced the issue and as @kidotaka gave us a great survey in #2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash
is 0d59859, and add support feature names as UTF-8 again.
Sample codes
reprodude the issue
The below code raises
lightgbm.basic.LightGBMError: Wrong size of feature_names
use utf-8 characters
You can see there is no problem with using utf-8 in feature names.